Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

نویسندگان

  • Burcu Can
  • Ahmet Üstün
  • Murathan Kurfali
چکیده

Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditional random fields (CRF) and we employ the morpheme tags in part-of-speech (PoS) tagging by using hidden Markov models (HMMs) to mitigate sparsity. Results show that using morpheme tags in PoS tagging helps alleviate the sparsity in emission probabilities. Our model outperforms other hidden Markov model based PoS tagging models for small training datasets in Turkish. We obtain an accuracy of 94.1% in morpheme tagging and 89.2% in PoS tagging on a 5K training dataset.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

Joint PoS Tagging and Stemming for Agglutinative Languages

The number of word forms in agglutinative languages is theoretically infinite and this variety in word forms introduces sparsity in many natural language processing tasks. Part-of-speech tagging (PoS tagging) is one of these tasks that often suffers from sparsity. In this paper, we present an unsupervised Bayesian model using Hidden Markov Models (HMMs) for joint PoS tagging and stemming for ag...

متن کامل

Generalized unknown morpheme guessing for hybrid POS tagging of Korean

Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general le...

متن کامل

Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean

Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknownmorpheme-estimation method with POSTAG (POStech TAGger), which is a statistical and rule-based hybrid POS tagging system. This method of guessing unknown morphemes is based on a combination of a morpheme pattern dictionary...

متن کامل

Uzbek-English and Turkish-English Morpheme Alignment Corpora

Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morpho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016